dom4j 非法字符过滤

最新推荐文章于 2024-07-25 19:54:22 发布

qingcaolin

最新推荐文章于 2024-07-25 19:54:22 发布

阅读量3k

点赞数 1

分类专栏： java 文章标签： exception string character xml nested class

本文链接：https://blog.csdn.net/qingcaolin/article/details/7199233

版权

java 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

xml官方定义的非法字符有：0x00 - 0x08，0x0b - 0x0c，0x0e - 0x1f ，这3类均为assii 的低阶打印字符，遇到这样的字符时dom4j解析会抛出：Nested exception:

org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1) was found in the element content of the document.这样的异常。

解决方法：

测试主类，注意FilterInputStreamReader为包装类

package xu.dom4j;

import java.io.FileInputStream;

import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;

import xu.java.io.FilterInputStreamReader;

public class Test
{
    
    public static void main(String[] args) throws Exception{
        String filePath = "";
        String fileEncoding = "utf-8";
       Element e = parseFileAndFilter(filePath,fileEncoding);
       System.out.println(e);
    }
    public static Element parseFileAndFilter(String filePath ,String fileEncoding)throws Exception{
        SAXReader reader  = new SAXReader();
        Document doc = null;
        reader.setEncoding(fileEncoding);
        doc = reader.read(new FilterInputStreamReader(new FileInputStream(filePath),fileEncoding));
        return doc.getRootElement();
    }
}

FilterInputStreamReader包装类，继承自InputStreamReader

package org.dom4j.io;

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
/**
 * 该类继承自InputStreamReader， 主要解决 dom4j解析xml时出现非法字符。<br/>
 *  xml官方定义的非法字符范围为：
 *  [0x00 - 0x08],
 *  [0x0b - 0x0c],
 *  [0x0e - 0x1f],
 *  这些都是无法打印的低阶 assii符号。
 * <br/>
 * 实现原理为： SAXReader 的read()方法需要一个 Reader 对象， 在SAXReader的内部，会调用 输入Reader的read(char cbuf[], int offset, int length) 方法读取内容，
 * 因此，本过滤器 包装了 read 方法，对非法字符进行过滤，实际为替换(用空格替换)，采用替换可能会带来一定问题，例如：非法字符正好在名称中间，可能会导致解析异常。但采用过滤会 read(char cbuf[], int offset, int length)
 * 包装上带来困难，例如：过滤后的缩进如何处理，目前方案1为，用read()方法替换。 方案2. 减少读取字符数。 read返回数量减少。
 * 
 *<br/>
 *具体后续遇到再讨论处理
 * 
 * @author xujg 
 * @date 2012-1-13
 *
 */
public class FilterInputStreamReader extends InputStreamReader
{
	private static final int replaceChar = 0x20;//空格

	public FilterInputStreamReader(InputStream in, String charsetName) throws UnsupportedEncodingException
	{
		super(in,charsetName);
	}
//	  
	public int read()throws IOException
	{
	    int ch = super.read();
	    if(ch>0x1f) 
	        return ch;
	    if(ch == 0x0d || ch == -1 || (ch>0x08 && ch<0x0b))
            return ch;
	    return replaceChar;
	}
	public int read(char cbuf[], int offset, int length) throws IOException {
	    int count= super.read(cbuf, offset, length);
	    for( int i = 0;i<count;i++){
	        if(cbuf[i]>0x1f) 
	            continue;
	        if(cbuf[i] == 0x0d || cbuf[i] == -1 || (cbuf[i]>0x08 && cbuf[i]<0x0b))
	            continue;
	        cbuf[i]= replaceChar;
	    }
	    return count;
	}
}

说明：

1、 SAXReader内部调用的是read(char cbuf[], int offset, int length) 方法获取内容，对read（）方法的修改对解析不起作用

2、如果是对InputStream，则过滤方法应该类似，包装下对应的read方法

3、本过滤器严格意义上并不是过滤器，只是对特殊字符用空格进行了替换。

4、考虑到性能，过滤写在方法内，没有抽象成方法，（过滤方法调用数量多，出入栈会对性能有影响）

5、实现过滤需要对两个方法分开处理,下面为实现。

read（）过滤,遇到特殊字符时，递归返回下一个合法的：

public int read()throws IOException
	{
	    int ch = super.read();
	    if(ch>0x1f) 
	        return ch;
	    if(ch == 0x0d || ch == -1 || (ch>0x08 && ch<0x0b))
            return ch;
	    return read();
	}

read(char cbuf[], int offset, int length) ，过滤时需要将后续内容前移动，然后修改读取数量。（下面代码未测试）

// read方法，SAXReader 读取内容时实际是调用此方法获取内容
	public int read(char cbuf[], int offset, int length) throws IOException {
	    int count= super.read(cbuf, offset, length);
	    int invalidNum = 0;
	    for( int i = 0;i<count;i++){
	        if(cbuf[i]>0x1f) 
	        {
	            //字符实际位置修改
	            cbuf[i-invalidNum] = cbuf[i];
	            continue;
	        }
	        if(cbuf[i] == 0x0d || cbuf[i] == -1 || (cbuf[i]>0x08 && cbuf[i]<0x0b))
	        {  
	            cbuf[i-invalidNum] = cbuf[i];
	            continue;
	        }
	        //非法字符数量增加
	        invalidNum++;
	        
	    }
	    //读取数量减少
	    return count-invalidNum;
	}

注：本人第一次发代码帖子，有问题欢迎指正，也欢迎大家交流。

转载请说明出处，谢谢！

qingcaolin

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
dom4j 非法字符过滤

xml官方定义的非法字符有：0x00 - 0x08，0x0b - 0x0c，0x0e - 0x1f ，这3类均为assii 的低阶打印字符，遇到这样的字符时dom4j解析会抛出：Nested exception: org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1) was found in the
复制链接

扫一扫

专栏目录